Multi-Script Line Identification System for Indian Languages

نویسندگان

  • Prakash K. Aithal
  • Rajesh Gopakumar
  • Dinesh U. Acharya
چکیده

India is a multilingual multi-script country. There are totally 18 official languages and 12 scripts in India. For Optical Character Recognition (OCR) of such a multi-lingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Malayalam, Telugu, Tamil, Gujarati, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile, Vertical projection profile and Top pitch information to distinguish the seven scripts. The knowledge base of the system is developed based on 50 different document images containing about 250 text lines of each script. The proposed system is tested on 50 different document images containing about 250 text lines of each script and an overall classification rate of 97.64% is achieved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Script Identification from Bilingual Gujarati-English Documents

In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script ...

متن کامل

Trainable Script Identification Strategies for Indian Languages

Identification of the script in an image of a document page is of primary importance for a system processing multi-lingual documents. In this paper three trainable classification schemes have been proposed for identification of Indian scripts. The first scheme is based upon a frequency domain representation of the horizontal profile of the textual blocks. The other two schemes use connected com...

متن کامل

Gabor Features Based Script Identification of Lines within a Bilingual/Trilingual Document

The OCR technology for Indian documents is in emerging stage and most of these Indian OCR systems can read the documents written in only a single script. As many commercial and official documents of different states of India are tri-lingual in nature, therefore identification of script and/ or language is one of the elementary tasks for multi-script document recognition. A script recognizer sim...

متن کامل

Multi-Script Line identification from Indian Documents

A document page may contain two or more different scripts. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different scripts before feeding them to their individual OCR system. In this paper an automatic scheme is presented to identify text lines of different Indian scripts from a document. For the separation task at first the scripts are grouped int...

متن کامل

A Multiple Feature based Novel Approach for Identification of Printed Indian Scripts at Word Level

In a country like India where different scripts are in use, automatic identification of printed script facilitates many important applications such as automatic transcription of multilingual documents and for the selection of script specific OCR in a multilingual environment. In this paper a novel method to identify the script type of the collection of documents printed in seven Indian language...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010